Let’s investigate our data to see what the numbers can tell us about eating fish and eating sugar and how both relate to happiness scores and incidences of anxiety & depressive disorders across many countries around the world.

Remember: not all countries were represented in each data set collected, and when refining some were lost. This means not all countries are accounted for in this investigation, though 150 countries is no small number


Initial imports

First we’ll import the libraries we’re going to use.

shh <- suppressPackageStartupMessages
shh(library(tidyverse))
shh(library(broom.mixed))
shh(library(lme4))
shh(library(ggeffects))
shh(library(performance))
shh(library(plotly))

Next we’ll import our data.

You’ll notice that the data has already been cleaned and arranged in a way that makes the following operations straightforward. The original data files are available from their sources (linked on the home page), or in the github repository.

data <- tibble(read_csv('../data/happy_sad_sugar_fish.csv')) |> 
  arrange(country)
## Rows: 1600 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): country
## dbl (5): year, happiness_score, fish_kg_per_person_per_year, sugar_g_per_per...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data
## # A tibble: 1,600 × 6
##    country    year happiness_score fish_kg_per_person_p…¹ sugar_g_per_person_p…²
##    <chr>     <dbl>           <dbl>                  <dbl>                  <dbl>
##  1 Afghanis…  2008            37.2                   0.08                   23.4
##  2 Afghanis…  2009            44                     0.07                   23.3
##  3 Afghanis…  2010            47.6                   0.07                   24.2
##  4 Afghanis…  2011            38.3                   0.07                   24.8
##  5 Afghanis…  2012            37.8                   0.07                   25.4
##  6 Afghanis…  2013            35.7                   0.07                   24.5
##  7 Afghanis…  2014            31.3                   0.19                   34.7
##  8 Afghanis…  2015            39.8                   0.21                   32.6
##  9 Afghanis…  2016            42.2                   0.23                   29.9
## 10 Afghanis…  2017            26.6                   0.25                   32.6
## # ℹ 1,590 more rows
## # ℹ abbreviated names: ¹​fish_kg_per_person_per_year,
## #   ²​sugar_g_per_person_per_day
## # ℹ 1 more variable: pct_new_per_pop_disorders <dbl>

Modeling the data

Linear mixed effects regression models are used when not all observations are independent incorporating variables that have both fixed effects and random effects.

In our case, the fixed effects predictors are fish_consumption, sugar_consumption, and year. The random effects predictor is country. The target variable is either the happiness_score or the sadness_score.

We’ll use the lmer package to evaluate.

mixed_model <- lmer(
  happiness_score ~ year + sugar_g_per_person_per_day + 
    fish_kg_per_person_per_year + (1 | country), 
  data = data)
summary(mixed_model)
## Linear mixed model fit by REML ['lmerMod']
## Formula: 
## happiness_score ~ year + sugar_g_per_person_per_day + fish_kg_per_person_per_year +  
##     (1 | country)
##    Data: data
## 
## REML criterion at convergence: 9489.4
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -5.5305 -0.5675  0.0132  0.5632  4.7850 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  country  (Intercept) 91.97    9.590   
##  Residual             14.87    3.856   
## Number of obs: 1600, groups:  country, 150
## 
## Fixed effects:
##                              Estimate Std. Error t value
## (Intercept)                 -0.048114  56.500736  -0.001
## year                         0.025223   0.028171   0.895
## sugar_g_per_person_per_day   0.017764   0.004474   3.971
## fish_kg_per_person_per_year  0.113243   0.036758   3.081
## 
## Correlation of Fixed Effects:
##             (Intr) year   s_____
## year        -1.000              
## sgr_g_pr___  0.325 -0.331       
## fsh_kg_p___  0.098 -0.108 -0.012

Random Effects

Random effects:
 Groups   Name        Variance Std.Dev.
 country  (Intercept) 91.97    9.590   
 Residual             14.87    3.856   
Number of obs: 1600, groups:  country, 150

The variance between countries (\(91.97\)) is much larger than the residual variance, suggesting that there are substantial differences in baseline happiness between countries.

  • this means that the difference in country accounts for much more of the variation in happiness score than the unexplained variation within countries.

  • the ratio \(91.97/(91.97 + 14.87) = 0.86\) suggests that around \(86%\) of the variation in happiness scores (after accounting for the fixed effects) is explained by differences in baseline happiness by country.

Fixed Effects

Fixed effects:
                             Estimate Std. Error t value
(Intercept)                 -0.048114  56.500736  -0.001
year                         0.025223   0.028171   0.895
sugar_g_per_person_per_day   0.017764   0.004474   3.971
fish_kg_per_person_per_year  0.113243   0.036758   3.081

We gain some interesting insight when looking at the fixed effects. There is a positive association between year, sugar_consumption, and fish_consumption.

The \(t-values\) (calculated as \(t = Estimate / Std. Error\)) will tell us what is statistically significant and what is not. As a rule of thumb, when using mixed effects models:

  • $t$ > 2 suggests statistical significance at approximately the \(p = 0.05\) level.

  • $t$ > 2.6 suggests significance at approximately the 0.01 level.

  • $t$ > 3.3 suggests significance at approximately the 0.001 level.

Based on the \(t\) values, we conclude that year is not statistically significant, fish_consumption is very statistically significant, and sugar_consumption is highly statistically significant.

# Extract predictions while varying one predictor at a time
effects_year  <- ggpredict(mixed_model, terms = "year")  
effects_sugar <- ggpredict(mixed_model, terms = "sugar_g_per_person_per_day")
effects_fish  <- ggpredict(mixed_model, terms = "fish_kg_per_person_per_year")

# Plot each effect separately
plot1 <- ggplot(effects_year, aes(x = x, y = predicted)) +
  geom_line(color = "blue") +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.2) +
  labs(title = "Effect of Year on\nHappiness Score", x = "Year", y = "Predicted Happiness Score")

plot2 <- ggplot(effects_sugar, aes(x = x, y = predicted)) +
  geom_line(color = "red") +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.2, fill = "red") +
  labs(title = "Effect of Sugar Consumption\non Happiness Score", x = "Sugar (g per person/day)", y = "Predicted Happiness Score")

plot3 <- ggplot(effects_fish, aes(x = x, y = predicted)) +
  geom_line(color = "orange") +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.2, fill = "green") +
  labs(title = "Effect of Fish Consumption\non Happiness Score", x = "Fish (kg per person/year)", y = "Predicted Happiness Score")

# Arrange plots in a grid
library(patchwork)
plot1 + plot2 + plot3

This is a caterpillar plot or forest plot. it visually represents the estimated random intercepts for each country.

  • Values to the right of zero (positive values): Countries with positive random effect estimates (points to the right of the red dashed line) have a higher intercept than the average intercept in the model. In simpler terms, after accounting for year, sugar_g_per_person_per_day, and fish_kg_per_person_per_year, these countries tend to have a higher baseline happiness score compared to the average country in this dataset.

  • Values to the left of zero (negative values): Countries with negative random effect estimates (points to the left of the red dashed line) have a lower intercept than the average intercept. They tend to have a lower baseline happiness score than the average country, even after considering the fixed effects.

  • Values close to zero: Countries with random effects close to zero have baseline happiness scores that are very close to the average baseline happiness score, after accounting for the fixed effects. Their happiness is already well-explained by the fixed predictors in the model, and there’s not much country-specific deviation from the average intercept needed.

# Get random effects
ranef_df <- ranef(mixed_model)$country %>%
  as.data.frame() %>%
  rownames_to_column("country") %>%
  rename(effect = "(Intercept)") %>%
  arrange(effect)

# Plot random effects with confidence intervals
ggplot(ranef_df, aes(y = reorder(country, effect), x = effect)) +
  geom_point() +
  geom_vline(xintercept = 0, linetype = "dashed", color = "red") +
  theme_minimal() +
  labs(x = "Random Effect Estimate", y = "Country",
       title = "Country Random Effects with 95% CI")

# Residuals vs. fitted
augmented_data <- augment(mixed_model)
ggplot(augmented_data, aes(x = .fitted, y = .resid)) +
  geom_point(alpha = 0.5) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  geom_smooth(method = "loess", se = FALSE, color = "blue") +
  theme_minimal() +
  labs(x = "Fitted values", y = "Residuals",
       title = "Residuals vs Fitted Values")
## `geom_smooth()` using formula = 'y ~ x'

# QQ plot of residuals
ggplot(augmented_data, aes(sample = .resid)) +
  stat_qq() +
  stat_qq_line() +
  theme_minimal() +
  labs(title = "Normal Q-Q Plot of Residuals")

# Convert ggplot to interactive plot
p <- ggplot(ranef_df, aes(y = reorder(country, effect), 
                         x = effect,
                         text = paste("Country:", country,
                                    "\nEffect:", round(effect, 2)))) +
  geom_point() +
  geom_vline(xintercept = 0, linetype = "dashed", color = "red") +
  theme_minimal() +
  labs(title = "Country Random Effects with 95% CI",
       x = "Random Effect Estimate", y = "Country")

ggplotly(p, tooltip = "text")